General Information: In the following dataset, we have chemicales that are added in the wine and will effect postivily or the opposite, so our goal is to find out which chemicale has the most influnce on the quility of the Red Wine
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
Total number of wines is 1599 the Max Quailty is 8/10 and the Min is 3/10. All of them contain Sugar and Chorides(salt) because the min isn’t ZERO pH is great because it’s from 2.7 to 4 which is a great pH SCALE!!! ^^
table(rw) Our table is to large ‘a table with >= 2^31 elements’
First, we want to see were most of are wine’s quality is so we can see how good the wines are in general. The following plot is used to check the quantity and quality of RedWine Well, it’s not that great, most are between 5 and 7, but we’ll see the reasons behind them :)
I was curious about the Distribution of Chemical Properties and what had the most effect on the quality of the wine, so I decided to plot them using histogram.
based on the above plot, we can see that it’s almost a normal distribution and most our wine contains less than 12 ‘fixed.acidity’, maybe we need more to have a better quality, we’ll see.
Again, a close to normal distribution plot, we can also see that most and almost all wine contain less than 1.2 ‘volatile.acidity’, we’ll see the effect of it on the quality later on.
Now let’s continue and plot the rest so we can see how they are ploted
Wow, most of our plots so far are close to normal distribution! also, we can see that most wine contains <=0.50 ‘citric.acid’ ——
this one is different, a right skewed plot…. But something more interesting, it has less than 4 ‘residual.sugar’, we’ll understand it’s affect on the quality later on.
normal distribution… but it looks like it has very low chlorides, hmm we’ll see it’s effect later on and understand it more :)
Low pH, maybe that’s the reason behind the non-perfect quality hmm.. we’ll dig deeper later on. ——
right skewed… but we can see that most of our wine has less alohol.. maybe this is the reason behind the quality ranking hmmm
So we have seen a lot of right skewed, maybe they are the reason behind not having almost great quality (9 or 10). But let us see
Now we’ll get the correlations between different var. * quality
##
## Pearson's product-moment correlation
##
## data: rw$fixed.acidity and rw$quality
## t = 4.996, df = 1597, p-value = 6.496e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.07548957 0.17202667
## sample estimates:
## cor
## 0.1240516
##
## Pearson's product-moment correlation
##
## data: rw$volatile.acidity and rw$quality
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4313210 -0.3482032
## sample estimates:
## cor
## -0.3905578
##
## Pearson's product-moment correlation
##
## data: rw$citric.acid and rw$quality
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1793415 0.2723711
## sample estimates:
## cor
## 0.2263725
##
## Pearson's product-moment correlation
##
## data: rw$residual.sugar and rw$quality
## t = 0.5488, df = 1597, p-value = 0.5832
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.03531327 0.06271056
## sample estimates:
## cor
## 0.01373164
##
## Pearson's product-moment correlation
##
## data: rw$chlorides and rw$quality
## t = -5.1948, df = 1597, p-value = 2.313e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.17681041 -0.08039344
## sample estimates:
## cor
## -0.1289066
##
## Pearson's product-moment correlation
##
## data: rw$alcohol and rw$alcohol
## t = 1896300000, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 1 1
## sample estimates:
## cor
## 1
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
1599 observations, and 13 variables
In the Quailty, and it’s between 0-10
volitile acidity, citric acid, residual sugar, and chlorides will be the best predictors. All of those seem to do with taste.
No, no need to create new variables
The plot is noisy due to the limit scale and the large data points
We have already calculated the correlation between quality and different var. now we want to have a closer look at them
It looks like as volatile acidity increases, quality decreases, although there are two observations worth mentioning:
Wines with a quality score of seven and eight (the best of the dataset) have similar median volatile acidity. However, the volatile acidity of wines with a quality of seven is more dispersed.
Outliers with a quality of seven and eight have a volatile acidity similar to the median of the worst rated wines: volatile acidity alone cannot explain the differences in quality.
These findings agree with the information provided by the authors of the dataset: “too high levels can lead to an unpleasant, vinegar taste”.
The variables ‘quality’ and ‘citric.acid’ are positively correlated. However, wines with a quality score of seven or eight present very similar levels of citric acid. For the rest, the amount of citric acid is very dispersed, although the median citric acid quantity for low quality wines is very low.
It looks like the higher the amount of alcohol content in a wine, the better the score it receives, but this effect only appears in wines with a quality of six or more, having the rest similar median values. There are a lot of outliers with a high percent of alcohol between the wines of quality five.
The amount of sulphates is slightly positively correlated with the quality of the wine, but the effect is not as pronounced as with the other variables mentioned above. There are a lot of outliers.
It seems to exist a mild negative correlation between ‘density’ and ‘quality’. I doubt the experts can detect such small variations in density between different wines, or even care about it. My guess is that this is due to ‘density’ being correlated with other influential variables, like ‘alcohol’, or just pure randomness.
This is similar to the last case. Can we detect with our sense of taste differences in pH of one unit maximum?. Maybe this is caused by the existing negative correlation between ‘pH’ and ‘citric.acid’.
Both ‘volatile.acidity’ and ‘pH’ are negatively correlated with ‘citric.acid’ (-0.552 and -0542, respectively). The latter makes sense: low pH values indicate acidity.
The relation between acetic acid (volatile.acidity) and citric acid is not that clear.
High levels of alcohol are asociated with low density (-0.496), which makes sense, since alcohol is less dense than water.
I found it interesting that higher alcohol content had a higher probability of getting a good quality score. Also, sugar didn’t have much impact on the quality of the wine.
I noticed that density and alcohol had a stronger negative correlation than others.
pH and fixed acidity
Plot for Quality by Volitile Acidity and Alcohol
I tried to make the colors distinct here and I still can’t see a clear pattern. Maybe citric acid and alcohol together can predict quality?
There is a little bit of a pattern where the dots get redder up and to the right, but it really doesn’t look like much of a pattern. At this point I think picking the two variables with the highest correlation coefficients might reveal something.
Alcohol by Chlorides for Differing Quality Red Wines
Alcohol Content by Wine Quality with multi boxplots
The only relationship that really saw was with that last plot. You can tell that as the alcohol increases and the volitile acidity decreases, the quality increases.
Nope
From the above box plots, we can see the average Alcohol in each quality range, where the high quailty contains more alcohol (Volumn)
The above scatter plot descibes the amount of Chloride and Alcohol on every red wine, and also shows how they effect the quality of the wine. We can see that the highest wine quality contains less Chlorides and more Alcohol.
This dataset has 11 physiochemical properties of 1599 red wines.
For the Uni plots, I used line plot to see the curve of the quality which was very messy and hard to read so I added geom_smooth to observe it easer. Also I used cor.test to see the correclation between the chemicals(physiochemical) and quality.
For the Bivariate plot, I used scatter plot to find the realtionship between various variables, also I used smooth to make it easer to read and understand with method ‘lm’ (linear model)
For the Multivariate plot, which was the hardest for me :( I also used jitter plot and used made the color equal to different variables so I can read it easier and make it more meaningful.
The strugles that I faced was understand wine and it’s different physiochemical that are used in it, it was hard for me to choose which var I’ll use to graph becuase I found it hard to understand them and I’m not a wine drinker :)
The only suprise I found was sugar not having a high impact on the wine quality, becuase normally in any food, when you have high amount of sugar it’s hard to eat it and it becomes tasteless, but in this dataset, it’s different.
Everything went well in this project after fully understating the variables
I think this is a short dataset without a limited numbers of obs, I think in the future if it had like +50k obs we would fully understand different impacts on wine quality.
AND THAT’S IT!!! THANK YOU VERY MUCH FOR THIS INTERSTING PROJECT <3